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Abstract 

Multiple-input multiple-output (MIMO) systems are among the most promising transmission tech- 
niques to achieve high data rate and high reliability transmission over wireless channels. The recently 
proposed Golden code is an optimal space-time block code for 2x2 MIMO systems. The aim of 
this work is the design of a VLSI decoder for a MIMO system coded with the Golden code. The 
architecture is based on a rearrangement of the sphere decoding algorithm that achieves maximum- 
likelUiood (ML) decoding performance. Compared to other approaces, the proposed solution exhibits 
an inherent flexibility in terms of QAM modulation size and this makes our architecture particularly 
suitable for adaptive modulation schemes. Relying on the flexibility of this approach two different 
architectures are proposed: a parametric one able to achieve high decoding throughputs (>165 Mbps) 
while keeping low overaU decoder complexity (45 KGates), with respect to other proposed solutions; a 
flexible implementation able to dynamically adapt to the modulation scheme (4-,16-,64-QAM) retaining 
the low complexity and high throughput features. In addition, a deep analysis of finite precision effects 
on the performance is presented in this work for both 16 and 64 QAM. 

Index Terms 

VLSI, digital architectures. Golden code, MIMO, sphere decoding 

I. INTRODUCTION 

The hardware implementation of high data rate and high reliability wireless communication 
systems is one of the most widely investigated topics within the scientific community and has 
raised new engineering and research challenges for many years. Higher transmission reliability 
demands for higher levels of processing complexity in the mobile terminal, while faster data 
rates require increased throughput: both evolutive trends are strong driving forces for the search 
of novel efficient architectures implementing the most critical base-band processing functions. 
In particular new standards proposed to regulate Wireless Local Area Networks (WLAN) and 
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Metropolitan Area Networks (MAN) are significant examples of very challenging applications 
from the implementation point of view. 

There are two main objectives on which research is actually focused. The first goal is to 
make wireless communication data rate comparable to that of wired communications: recent 
results show it is possible to approach IGb/s data rate [19], [27] . The second one is to improve 
reliability, by combating multipath, noise and interference effects. The recourse to multiple-input 
multiple- output (MIMO) systems seems to be one of the most promising solutions to reach both 
these results. 

Traditionally, MIMO systems were conceived with the purpose of dealing with one of these 
two objectives, by means of transmit antenna diversity combined with space-time coding. More 
recently great efforts have been made in unifying both goals and some new space-time codes are 
now able to reach the best tradeoff between data rate and diversity gain, although they require 
more sophisticated detection schemes at the receiver [1], [4], [6], [15], [21], [26]. 

The main contribution of this work is in the hardware design of a decoder for this kind of 
codes, in particular for the decoding of a 2 x 2 MIMO signal coded with the Golden code [1]. 
Golden code is a recently proposed full-rate and full-diversity space-time block code, chosen for 
its good energy efficiency. The maximum-likelihood (ML) decoding algorithm for the Golden 
code is based on the Sphere Decoder, which has already been widely addressed in the literature 
also from a hardware implementation point of view [16]- [28]. 

Several architectures have been proposed for the efficient implementation of the sphere de- 
coding architecture, but they are optimized for specific modulation schemes and do not support 
reconfigurability features. In [2, ASIC-I], in order to reach high throughput dedicated multipliers 
and parallel computations are used adopting a so called "one node per cycle" architecture. Other 
architectures instead take advantage of suboptimal algorithms: good examples of this approach 
are given in [2, ASIC-II], where the Loo -norm is implemented as an alternative to the more 
expensive L2-norm, and in [11], where the K-Best algorithm allows for performance-complexity 
trade-offs. These choices lead to fully optimized architectures, achieving high throughput; how- 
ever, they are not ML (the loss is about 1.4 dB in the case of [2, ASIC-II]) and have been proposed 
for specific modulation and transmission schemes, although in [2] the possibility to adapt the 
proposed solution to different modulations is also mentioned. In this work we overcome these 
limitations, proposing two novel architectures designed with VHDL as a reusable intellectual 
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property (IP) macrocell: the first one is parametrized with respect to the fixed-point representation 
of data and to the addressed modulation scheme; in order to enable comparisons with previous 
implementations, synthesis results are provided for this architecture in the case of 16 QAM. The 
second architecture is flexible, meaning that it can be dynamically configured to cover multiple 
modulation schemes. We note that both these hardware implementations can be equivalently 
used in a 4 X 4 uncoded MIMO system [27]. 

In Section In] we briefly explain properties, construction and detection of Golden code. Section 
Hill is dedicated to reviewing the sphere decoding algorithm. In section |IV] the effects of fixed- 
point precision on the code performance are derived. A short introduction to the overall scheme 
of the MIMO receiver is given in Section |Vl with particular attention to QR decomposition 
preprocessing unit; the detailed descriptions of the two hardware implementations are then carried 
out in |VI] and IVIIi In the last two sections results and conclusions are presented. 

II. Golden code 

The Golden code is a space-time (ST) code for a 2 x 2 coherent MIMO channel, it was found 
independently by [1], [6], [26]. 

Number theoretical methods have been widely employed to construct full-rate and full-diversity 
codes for coherent MIMO systems. These methods are based on the rank and the determinant 
criteria. In a Rayleigh fading channel the pairwise error probability (PWEP) expression [22] 
shows that the error probability can be minimized operating mainly on two aspects: diversity 
and coding gain. In [22] it was proved that these parameters are related to the so called codeword 
difference matrix D, which is constructed as the difference between two codewords. In order to 
maximize the diversity gain, the space-time code must be designed so that the difference matrix 
between any two codewords is full rank {rank criterion). On the other hand, the coding gain, 
depends on the determinant of DD^ and high coding gain is achieved maximizing the minimum 
of this determinant over all codeword pairs {determinant criterion). 

Golden code satisfies both the rank and the determinant criterion and in particular, differently 
from previously known codes, presents the non-vanishing determinant property, i.e., its minimum 
determinant is 1/5 and does not depend on the size of the signal constellation. For this reason 
it can be successfully employed in systems with adaptive selection of the modulation. 

Besides these properties, the Golden code has also the peculiarity to be energy efficient. It is 
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constructed using a rotated version of the Z[i]'^ complex lattice, so that there is no loss due to 
shaping [1]. 

The codewords X of the Golden code are 2 x 2 complex matrices of the following form 
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(1) 



a[a + be] a[c + d0] 

ia{a)[c + da(e)] a(a)[a + ba{e)] 

where a, 6, c, d are the information symbols chosen in a Q^-QAM=((5-PAM)^ constellation, 

i = ^/^,e = (1 + V5)/2 = 1.618. . . (Golden number), a{9) = (1 - \/5)/2 = l-9,a ^ 
l + i-ie^l + ia{9), a{a) = l + i- ia{9) = l + i9, [25]. 



A. The 2x2 MIMO System Model 

In order to model the 2 x 2 MIMO channel, its impulse response can be used. Assuming hij 
as the time-varying channel fading coefficients between the j-th transmit antenna and the i-ih. 
receive antenna, the MIMO channel is described through a 2x2 matrix: 



(2) 



/ill hi2 

^21 ^22 

where hij ~ A^c(0, 1). Assuming the "Block Fading" channel model, each transmitted codeword 
will be affected by an independently varying channel matrix l-L. Then, the 2 x 2 received matrix 
is 

Y = ?iX + Z 

where Z is the additive white gaussian noise matrix with entries ~ j\4(0, A^o)- 

We note that each codeword is sent in two channel uses of the two transmit antennas, for 
a total of four component signals. It is convenient to represent the codeword X in vectorized 
form where, furthermore, real and imaginary components are separated, resulting in a 8 x 1 real 
vector X. The channel matrix 7i can be consequently rearranged in a 8x8 real-valued matrix 
H. It can be seen that x — Gs, where G is a 8 x 8 orthogonal matrix {G~^ — G^) and 
s = (3?a, ^a, 3?6, ^h, 3?c, $Jc, 3?d, ^d) with entries from a Q-PAM constellation, [25]. 
The vectorized system model can so be expressed as: 



y = Hx + z = HGs + z 



(3) 
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where y is the 8x1 received real vector and 2; is a 8-dimensional i.i.d. (independent and 
identically distributed) zero mean gaussian noise real vector. 

B. Decoding the Golden code 

Decoding the Golden code is equivalent to decoding an 8-dimensional lattice with generator 
matrix M = HG. Provided that H is perfectly known at the receiver, the optimal detector for 
a MIMO channel, which minimizes the codeword error rate, is the maximum-likelihood (ML) 
detector. It solves the following equation: 

s = arg min llw — M"s|p (4) 

where is the cardinality of the search space and n = 8. 

The above expression represents a discrete least-square (LS) minimization problem. Exhaustive 
search of the ML solution has exponential complexity and in this particular case it has 2" '°S2 Q 
possible solutions. Sphere decoding algorithms have then been proposed in order to decrease the 
decoder complexity. 

in. Sphere Decoding Algorithm 

Sphere decoding algorithms denote a family of algorithms, which aim at lowering the com- 
plexity of the minimization @) by analyzing only a subset of the solution space [5]. These 
algorithms, in a certain range of parameters which is not too far from those of real systems, 
show a polynomial average complexity. Although other work [16] denies this theoretical proof, 
computer simulations still confirm the practical result. This behavior is due to the fact that y is 
not an arbitrary vector, but it is given by the transmitted vector Hx with a small offset due to 
additive noise z. 

Sphere decoding algorithms look at the set of possible solutions as points of a lattice and try 
to find the closest point to the received vector. In particular, a hypersphere is constructed around 
the received vector and only points inside it are taken into account, since the others are actually 
too far. This constraint can be written as: 

\\y-Msf<Co (5) 

where Co is the square radius of the hypersphere [9], [23], [24]. In the following we describe a 
method to easily compute distances between received signals and lattice points. 
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1 ) Tree construction: With a linear transformation of the matrix M, such as QR or Cholesky 
decomposition, it is possible to rewrite M as a product of two matrices, one of which upper 
triangular [5]. In this work, QR decomposition has been employed so that, given M = QR, (0]) 
can be rewritten as: 



arg min \\y — QRs\ 



arg min 1 1 Q^y — Rs I 

seQ" " ' 

arg min \\y — Rs\f 



(6) 



where we have exploited the orthogonality of Q and y = Q^y represents the zero-forcing (ZF) 
solution. The upper triangular structure of the factored matrix enables to take every component 
separately into account for the computation of the distance between the two points. The distance 
(i^(s) = 11^ — -Rs||^ can also be computed recursively as follows. Let us define the partial metric 
as in [2] 
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yn- Then we can write T^^\s) = (P{s). 
One of the most interesting consequences of this interpretation is that the exploration of the 
lattice can be thought as a tree traversal. This tree has n levels and every node at each level 
has Q sons. At every level the radius constraint ([5]) must be verified and satisfied, otherwise the 
branch is pruned. Figure [T] depicts a two level tree for a QPSK modulation. T*^') is the partial 
distance metric at level / in ([7]); at the lowest level, final metrics are explicitly calculated for 
this simple case. 

2) Tree exploration: Several algorithms have been studied in order to make the tree traversal 
efficient. First algorithm, proposed by Pohst in [9], needs to chose explicitly an initial radius. 
This is a very critical choice: if the radius is chosen too large, too many points fall into the 
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Fig. 1. Two level tree for QPSK modulation, where a\ — |j/2 + i?22p, £12 = |j/2 — i?22p, &i = + R12 + iiiip, 
&2 = \\yi + R12 - RiiW b3 = \yi - R12 + Rii\\ &4 = \yi - R12 - Rnl" 

hypersphere, while for a too small radius no points are left inside it. A more efficient algorithm 
has been proposed by Schnorr and Euchner (SE) [20]. The SE algorithm has intrinsically variable 
throughput and this makes it not very suitable for hardware implementation. The key to make 
this algorithm efficient or, at least, with predictable throughput, is to make an effective pruning. 
A lot of theoretical studies can be found in recent literature, which aim at finding techniques to 
reach this goal [28]. Although some of them give very interesting ideas, none of them seems to 
be effective nowadays, with a strong theoretical demonstration and a simple realization. 

IV. Fixed-point analysis 

The study of finite precision effects is a mandatory preliminary step in the design and hardware 
implementation of complex processing tasks. Although several implementations of the sphere 
decoding algorithm have been proposed, studies on finite precision effects are not available in 
literature. In this work, a wide range of simulations have been carried out in order to determine 
the effects of different fixed-point representations on the performance for both 16 and 64-QAM 
modulation schemes. 

The main conclusion that can be derived from results reported in Figure [2] is that the required 
number of bits increases when higher-order modulations are used. There are two reasons for 
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this increase: 

• with higher order modulations, Euclidean distances between constellation points decrease 
and a larger number of bits must be allocated in the fractional part to discriminate distances; 

• signal amplitudes are higher in higher order modulations, thus more bits need to be allocated 
also in the integer part. 

Simulation results show that a total of 12 bits lead to performance very close to the floating-point 
case for 16-QAM modulation, while 14 bits are necessary in the detection of 64-QAM signals. 
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Fig. 2. System bit error rate (BER) using 16 and 64-QAM; lowering the total number of bits. 



Finally, Figures [3] shows that the fixed-point approximation does not affect significantly the 
number of visited nodes of the algorithm. The plotis given as a function of the codeword error 
rate. 
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V. Preprocessing 

In this section we discuss the implementation issues related to pre-processing, which is required 
before the tree-search. This computation operates on the lattice generator matrix M = HG; 
since the code generator matrix is constant, the computation must be repeated at the channel 
estimation update frequency. 

The update frequency for the channel estimation can change significantly according to the 
scenario, but it is generally one or two orders of magnitude lower than the signalling rate. 
Figure |4] depicts a block diagram of a MIMO system adopting the Golden code; dashed blocks 
implement modulation and demodulation functions in a generic MIMO-OFDM system. The 
Golden code decoding phase is made of three functions: QR decomposition, column reordering 
and tree search. 
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Fig. 4. Golden Code MIMO System. 



While column reordering is an optional operation able to reduce the tree-search complexity, 
QR decomposition is mandatory because it allows constructing the tree and finding the ZF 
solution, possible techniques to perform the QR decomposition in hardware are reviewed in 
order to estimate the overall complexity of the receiver. 

A. QR decomposition 

As already outlined, a linear transformation of the channel matrix H, such as QR or Cholesky 
decomposition is needed in order to construct the tree. 

QR decomposition is a well studied numerical algorithm and widely used in many applications 
such as matrix inversion, adaptive beamforming and filtering. The QR decomposition based - 
Recursive Least Squares (QRD-RLS) methods are routinely adopted in applications such as 
multiuser detection in CDMA communications, adaptive equalization of radio channels etc. The 
method is well suited to VLSI realization and it can be implemented in a stable manner using 
relatively short word length arithmetic. 

Hardware realization of this technique implies the choice between Householder transformation 
and Givens rotation based algorithms [10]. This second approach can be accomplished by a 
sequence of rotation operations to annihilate elements under the main diagonal of the matrix. 
Givens rotations require a larger number of flops compared to Householder method in order to 
compute QR decomposition, nevertheless they may be implemented using highly parallel systolic 
arrays and for this reason they are usually preferred for hardware implementation. 

These arrays typically present linear, triangular, or square structure; the rotation angle is 
computed in boundary or diagonal processors and dispatched to other processors for rotation. 
The choice of the organization can be made on the basis of area and throughput considerations. 
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TABLE I 

QR DECOMPOSITION: DIFFERENT ARRAY ORGANIZATION PARAMETERS - NUMBER OF PROCESSING ELEMENTS (PE), 

LATENCY AND THROUGHPUT 



Architecture 


# of PES 


latency of single QR 


Throughput 


Triangular 


n{n + l)/2 


n{n + l)/2 


1/n 






2n^ - n 


l/{2n^ - n) 


Linear 


n 


(2n-l) + (f -1) (n + 1) 


l/[(2n-l) + (f -1) (n + 1)] 


Single Element 


1 


n^{n + l)/2 


l/[n^n+l)/2] 



The main parameters of this architecture are listed in Table H for a n x n matrix: number of 
processing elements (PE), latency and throughput. It is assumed that every processing element 
takes one or more clock cycles to perform its computation. 

Every single processing element must perform the angle calculation and the rotation to 
annihilate the matrix elements. Several alternatives exist to accomplish these two tasks, and 
the two main ones are: 

1) Sine and cosine of the angle are computed by means of operations including also square 
root and division. 

2) Direct calculation of the angle and then rotation using a CORDIC processor [12]. 

The main advantage of the first approach is that primitives can be optimized resulting in 
an efficient although expensive implementation. The second technique is less expensive, but 
outputs are generated with longer latencies and data-dependency between operations slows down 
the CORDIC algorithm. Many strategies have been adopted in order to alleviate the effects of 
data-dependencies, such as reordering look-ahead [3], [14], [17] or redundant arithmetic [8]. 

For lower data-rates, architectures that reuse the processing elements on different data have 
been proposed in [7], [13]. These architectures represent probably the best tradeoff for the 
applications addressed in this work. 



VI. First Hardvv'Are Implementation: 
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Parametrizable Solution 

The tree-search algorithm is considered as the most computationally intensive processing 
block in a MIMO detector, although column reordering and QR decomposition can also be 
heavy processing tasks. However, since the rate of updating for channel estimation is usually 
one or two orders of magnitude lower than the signalling rate, design constraints tend to be 
more stringent for the tree-search unit than for column reordering and the QR decomposition. 
Thus the focus of this work is on the hardware realization of the tree-search algorithm. 

As guidelines for the design of the architecture, two main objectives have been taken into 
account. The first requirement was a certain degree of flexibility in the choice of both modulation 
scheme. The second main design objective was a high decoding throughput, compliant with needs 
of modem wireless communication standards. 

In the developed architecture, the datapath width, the size of the search tree and the modulation 
scheme are tunable parameters that can be statically configured to make the detector adaptable 
to different systems. Although the system is described with reference to the special case of the 
Golden code, it can be also used to decode a 4 x 4 uncoded MIMO scheme. The key elements 
of the developed architecture are described in the following paragraphs. 

A. A flexible hardware solution 

The key processing task in the tree exploration algorithm is given by ([7]), where we recall that 
^(z+i) — _ Yl^=i+i ^ij^j' '^he l-th entry of an n elements vector '4^^''^^\ where / + 1 is the 
tree level we are referring to. At level /, the generic i-th entry of this vector can be decomposed 
in a recursive manner through the following expression 



j/j if I = n + 1 



(9) 



ijf'^ - Riisi if / = n, . . . , 1 

where i is in the range 1, . . . , n while the level / decreases from n + 1 to 1. 
The whole i/'''''' can therefore be updated by means of 

^(0 = ^a+i) _ ji^^^ / = n, . . . , 1 (10) 

where Ri is the l-th column of R and the initial value is given by i/^*-"^^^ = y. 
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In order to minimize the final metric cP{s) with a greedy algorithm, at each level of the tree 
the minimum ^f^^"* — RuSi value between all sons must be selected. More precisely, at each 
tree node, placed at level /, three main operations have to be accomplished: 

1) the si that minimizes the difference l^/^f^^^ — RuSi\ is selected 

2) the partial metric T'(s'^')) is calculated according to (|7]). 

3) for each i = 1, . . . ,n, ipf^ is evaluated for the selected si value, according to ^ 

Thus the straightforward minimization of partial metrics T'-{s^'''>) requires the difference compu- 
tation for all the possible values of si. This technique becomes increasingly expensive with high 
order modulations, due to the large number of required operations. 

In the proposed architecture, the minimization of T'(s(')) is rearranged in two steps. In the 
first processing step, the value of si that minimizes the difference l^/^z'^^^ — RiiSi\ is directly 
selected by means of a division; the obtained si is then used to generate ipf^ amounts in 
for alH = 1, . . . , n. At the second step, Q is finally evaluated to obtain the actual metric value 
T*^') for the selected son. Two functional blocks, U_psi and Metric_compute units, are allocated 
to perform the indicated processing steps. 



In order to find the value of si able to minimize {tpl''^^^ — RiiSi\ , U_psi unit (shown in Figure [5]) 
receives as inputs the derived at the upper tree level, together with the l-th diagonal element 

of matrix R. The result of the division ■ipf'^^'^/Ru is approximated to the closest odd integer. 
This approximation is equivalent to the selection of the closest point in a Q-PAM constellation. 

The resulting value directly provides the desired si for the analyzed node. The new ipf^ values 
are then evaluated in parallel, to be used at the lower tree level. 

Vector i/j^'-* is stored in a dedicated memory, which will be later referred to as Psi memory 
in the global architecture given in Figure [TOl 

The A output in Figure [5] is defined as 

A = Si - 



Rii 

and it represents the correction term to be applied to the division result in order to take the 
closest point in the equivalent PAM constellation. The use of A will be described later in this 
Section. 
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Fig. 5. U_psi Unit datapath 



The Metric_compute unit realizes the second processing step, evaluating the new metric T^'-^ 
for the selected son. Figure [6] shows the block architecture: from the upper tree level, T*^'+^^ 
is received as input, together with the ■* value generated by U_psi unit; the obtained T*^'^ is 
propagated to the lower tree level. 

The described approach, and particularly the use of a division to obtain the optimal si, allows 
avoiding multiple metric computations; thus it offers low complexity and, at the same time, 
flexibility in terms of supported modulation schemes. As a matter of fact, a parallel architecture 
tailored on a given search tree is able to achieve high processing speed, while the sequential 
computation of a single metric at each cycle makes it easier for the decoder to adapt to different 
structures of the search tree, so providing support to multiple modulation schemes. Similarly 
to what is done in a software implementation, sequential operations compute a single metric at 
every cycle, so that the same processing platform can easily adapt to different structures of the 
search tree by simply varying the number of search steps in the tree. 

On the other hand, differently from what was implemented in previous detectors, multipli- 
cations cannot be reduced to add and shift procedures since operands are not fixed and as a 
consequence general purpose multipliers have been allocated. 
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It is worth noting that, although the described technique introduces the division ipl / Ru, only 
a few values of this ratio are of interest for the algorithm, those that correspond to the equivalent 
PAM constellation points ±1, ±3, .... As a consequence, a general purpose hardware divisor is 
not necessary and the required operation can be executed by means of a simplified component 
able only to find the closest integer solution of this division and to determine if the approximation 
is by defect or by excess: the first \0g2Q steps of a successive subtraction divider [18] can be 
employed to this purpose, where Q'^ is the number of signals in the QAM constellation. This 
divider has a very simple architecture that employs only shifts and subtractions; although it 
tends to be very slow for a complete division, this solution can be effectively used when only 
a few shift and add steps are required. The divider employs a dichotomous process to find the 
requested value after logg Q steps. In the block diagram of Figure U\ the multiplexer selects 
the dividend at the first step and the subtraction result in the following ones; the n-bit variable 
shifter is used to shift the divider by a number of positions that changes from the initial value 
of log2 Q — I down to 0. The subtractor returns the result one bit per iteration, starting from the 
most significant one. 
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Fig. 7. Block diagram of the divisor 

B. Parallelism and pipelining 

The desired functional flexibility cannot be achieved at the expenses of processing throughput, 
but the final architecture must properly conjugate both features of flexibility and high data 
rates. Among effective techniques that can be used to increase throughput, parallelism and 
pipelining have been considered. In previous works, high throughput is obtained resorting to 
parallel architectures and two different kinds of parallelism are usually employed: 

• Parallelism at the level of tree exploration 

• Parallelism at the level of the metric computation for all sons of a given node and in the 
selection of the most probable son. 

The first technique can be used only with some suboptimal algorithms [28] and it becomes 
unfeasible when optimal algorithms are adopted, since it requires large amounts of hardware 
resources. The second approach is feasible only with low order QAM modulation schemes 
as it implies many concurrent multiplications. Thus these techniques are not viable for the 
implementation of parametric architectures. As a consequence, in this work, the pipelining 
technique has been investigated. 

In order to ensure that a new node is expanded at each clock cycle, a new, alternative metric 
must be available also after a pruning operation has taken place. As a consequence, when the 
metrics of a given father node are evaluated, two "candidate" nodes are concurrently computed: 
the first one is a direct son of the current node and it is processed by the U_psi unit, while the 
alternative node, placed at a higher level in the tree, is concurrently computed by the U_psi_step 
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Fig. 8. Architecture of the U_psi_step unit 



sub-circuit (see Figure [8]). Both of them generate novel values for the next step in the tree 
traversal. 

U_psi and U_psi^tep units share a very similar architecture, however the latter does not need 
to perform the division, as the second best choice for si (and thus for the alternative node) can 
be easily derived as follows. When U_psi unit computes the division, the result is approximated 
either by defect or by excess to the nearest PAM constellation point: the best choice for si is 
given by (see Figure |9l) 



(1) 



+ A 



(11) 



where A is the correction term provided as output by the U_psi unit (Figure [5]). 

The sign of A is used by U_psi^tep unit to take the second (and following) nearest point in 
the PAM constellation, according to the following rule, implemented in the top block of Figure [8] 

-l)'^sign(A) (k-l) A 



(12) 



where A is the distance between two consecutive points and the initial value, si^^^^ , is the closest 
point given in equation (fTTI) . 

Figure [9] shows the sequence of alternative nodes selected at a given tree level, after the 
occurrence of pruning. Depending on the values assumed by the father node metric, the algorithm 
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descends along the tree, reaching the son node, or it moves to the alternative node on the same 
level. It is worth noting that the computations of the i/)^^^ values for both son and alternative 
nodes are performed concurrently with the elaboration of the Ti metric for the father node. In 
other words, while the current metric is computed for the father node, the next node to be visited 
is identified choosing between the son and the alternative node. Additionally, the related i/^f^ 
value is computed to be used at the following step in order to obtain the proper metric i . 

This approach also provides a significant speed-up to the inherently serial SE Sphere Decoding 
algorithm and has a limited impact on complexity. 

C. Global architecture 

The block scheme of the SE tree-traversal circuit showing the architecture derived from 
the design criteria outlined in previous paragraphs is depicted in Figure [lOl Four fundamental 
processing blocks can be identified in this architecture: 

• U_psi unit, which selects the most probable son of the current node and computes updated 
■0^'^ through expression (fTOl) (see also Figure [5]); 

• U_psi_step unit, which selects the alternative node to be expanded and computes for this 
node the same amount; 

• Metric_compute unit, which computes metric of the current node T^^^ = T'^'+^)(s'^'+^)) + 
I'j/'f^^'* — RuSi\^, as in equation (|71); 

• C.U., control unit devoted to the proper selection of the tree search direction. 

The C.U. constitutes the core of the tree traversal algorithm and it must also carry-out two 
further tasks: to verify the pruning condition and, on the basis of this verification, to properly 
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dispatch data between the other units. Symbols given in Figure [TO] are related to the case of 
a node expanded in the depth-first mode, with no pruning: as a consequence, inputs of the 
Metric_compute unit are fed with outputs provided by U_psi block. When a pruning occurs, 
multiplexers are switched and metrics related to the alternative node are selected. 
Finally, Psi Memory stores V*^'^ vectors from one step to the following one. 




r('-i)(s) 



Fig. 10. Sphere decoder block scheme (case of a node expanded in the depth-first mode, with no pruning). 



VII. Second Hardware Implementation: Flexible Modulation Solution 

The capability of managing more than one modulation scheme in order to adaptively select the 
most efficient one according to user needs and channel conditions, is one of the most important 
requirements of modem wireless communications systems. The Golden code, thanks to the non- 
vanishing determinant property, is very well suited for such application since it achieves the 
best performance independently of the QAM size. In order to take full advantage of this Golden 
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code feature, an enhanced implementation has been realized to allow run-time choice of the 
modulation scheme. 

This implementation relies on the same architecture described in the previous section, with an 
additional parameter that allows the run-time selection of the constellation. The requirement of 
supporting multiple modulation schemes basically impacts on the control logic, while the other 
architecture components remain the same as in the first hardware implementation. 

At each level of the tree, the C.U., besides the pruning condition verification, also carries out 
a second verification task, related to the mapping constraint, it verifies if a certain value of si 
still belongs to the specified constellation and uses this information to drive the processing. 

This mapping constraint must also be taken into account in the division ipf^^'^/Ru. As 
the number of acceptable values for this operation depends on the adopted modulation, the 
constellation parameter is used to dynamically drive the iterations of the dichotomic division 
algorithm. 

Although the architecture deals with the implementation of the Golden Code where n = 8, it 
is also scalable in terms of n. Increasing the number of transmitting and receiving antennas: a 
larger value of the n parameter can be set in the VHDL code to synthesize detectors for larger 
STBcodes. Of course a larger n implies a more expensive architecture: particularly the value of 
n mainly affects: 

• the number of ip values to be evaluated in parallel in Figures 6 and 9 

• the depth of the tree 

• the size of i}) memory. 

The complexity of processing blocks in Figures |5] and [8] grows almost linearly with n\ the 
memory size increases as n^, because n values of have to be stored for n tree levels. Finally 
the throughput is expected to decrease with n, since the number of visited nodes grows, but this 
effect is strongly dependent also on the supported code. 

VIII. Synthesis results 

The first proposed architecture, tailored to process the 16-QAM case, has been synthesized 
on both 0.13/xm and 0.25/xm CMOS Standard Cell technologies, using the Synopsys Version 
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TABLE n 

Synthesis results and comparisons (16 bits) 











This work 


Reference 


ASIC-I [2] 


ASIC-II [2] 


[11] 


PARAMETRIZABLE IMR 


FLEXIBLE IMPL. 


Antennas 


4x4 


2x2 per two channel uses 


Modulation 


16-QAM 


16-QAM 


16-QAM 


16-QAM 


4,16,64-QAM 


Detector 


depth-first 




K-best 




depth-first 






sphere 


sphere 


sphere 




sphere 




BER Perf. 


ML 


Close to ML 


Close to ML 


ML 


Tech. [fim] 


0.25 


0.25 


0.35 


0.25 


0.13 


0.13 


Core Area [GE] 


117K 


50K 


91K 


56K 


45K 


55K 




+preproc. 


-l-preproc. 


+preproc. 


-l-preproc. 


-l-preproc. 


-l-preproc. 


Max. Clock 


51 MHz 


71 MHz 


100 MHz 


109 MHz 


250 MHz 


217 MHz 


Throughput 


73 Mbps 


169Mbps 


52 Mbps 


73 Mbps 


167 Mbps 


146 Mbps 




@SNR=20 dB 


@SNR=20 dB 




@SNR=20 dB 


@SNR=20 dB 


@SNR=20 dB 



Z-2007.03-SP1; synthesis on 0.13/im technology has been performed for the second flexible 
architecture. A commercial low-power library has been chosen. 

In order to enable the direct comparison with existing hardware realizations [2], I and II 
ASIC, [11], a 16 bit datapath has been chosen and the overall decoder has also been simulated 
with the uncoded 4x4 MIMO system and throughput figures reported in Table HI] refer to this 
configuration. 

The comparison of the described architectures to existing implementations tend to be quite 
difficult to carry out, because different approaches have been adopted: particularly, our solution 
implements the ML detection algorithm by means of a serial architecture, while the first ASIC 
in [2] maps the same algorithm onto a parallel structure and the second ASIC in [2] makes 
use of a serial scheme to realize a close to ML algorithm. These differences must be carefully 
evaluated while reading results in Table HIl 
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TABLE m 

Different datapath width synthesis results 



DP Width 


Area[kG] 


Period[ns] 


Freq.[MHz] 


Through. [Mbps] 


12 


41 


4.3 


232 


155 (16-QAM) 


14 


47 


4.45 


224 


150 (16-QAM) 


16 


55 


4.6 


217 


146 (16-QAM) 



Comparing the parameterizable architecture to parallel implementations in Table HIl the solution 
described in [11] and the first ASIC presented in [2], it can be observed that a single metric 
computation is performed at each cycle, instead of multiple parallel metric computations. This 
characteristic justifies both the reduced complexity and the inherent flexibility of the proposed 
architecture. At the same time, thanks to the adopted pipelined architecture, a remarkable average 
decoding throughput is achieved without any highly specialized structure. 

Implementation cost is slightly higher than for the second ASIC proposed in [2], where a 
serial approach is also adopted, in conjunction with a close to ML algorithmic approach. 

On the other hand, the flexible implementation in the last column of Table HI] prove the limited 
complexity and performance overhead associated to the capability of dynamically adapting to 
different modulations (4-, 16- and 64-QAM). 

Finally, the results presented in Section |IV] on the finite precision analysis of the decoding 
algorithm have been exploited to derive additional post synthesis figures for the flexible archi- 
tecture: these results, referred to different datapath widths, are given in Table Hill A total of 14 
bits are enough for the 64-QAM modulation (6 bits for the integer part and 8 for the fractional 
one) and the two saved bits grant a complexity reduction of 8 Kgates. 

IX. Conclusions 

A novel approach has been presented for the hardware implementation of a Sphere Decoder 
detector: the proposed solution uses a single metric computation per cycle and is well suited for 
pipelining, breaking the sequential nature of SD algorithm. 

The main element of novelty of the described approach is in its inherent flexibility that makes 
it suitable for the implementation of an adaptive modulation scheme. Two different hardware 
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architectures have been designed: the first implementation is a parametrizable one, while the 
second is able to adapt on the fly to different modulation schemes. 

The data representation format adopted in both implementations is based on exhaustive analysis 
of finite precision effects collected for 16 and 64 QAM modulations. 

Final synthesis results of the proposed architectures are listed in Table HI] and show a significant 
complexity reduction (approx. 50% for 16 QAM modulation) with respect to parallel structures. 
This is mainly due to the single metric computation per cycle. A remarkable average decoding 
throughput can be achieved with both implementations, thanks to the pipelining technique, even 
if the hardware was not tailored on a single modulation scheme as all previously proposed 
solutions. 
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